• Hadoop Pseudo-Distributed Operation
• 构建平台
• 准备构建 & Dockerfile 编写
• 开始构建
• 运行容器
• 关闭容器
• Reference
Hadoop Pseudo-Distributed Operation
Hadoop 可以方便的部署在 Docker 容器上。使用 Dockerfile 进行构建,以快速获得一个可运行的,以单机伪分布方式部署的 Hadoop 框架。
构建平台
采用 Windows Subsystem for Linux (WSL)进行构建。
• 系统信息:
_____________________________________________________
| o o o powershell |
|=====================================================|
| > wsl --version |
| WSL version: 2.3.26.0 |
| Kernel version: 5.15.167.4-1 |
| WSLg version: 1.0.65 |
| MSRDC version: 1.2.5620 |
| Direct3D version: 1.611.1-81528511 |
| DXCore version: 10.0.26100.1-240331-1435.ge-release |
| Windows version: 10.0.19045.5198 |
'====================================================='
• Docker 版本:
______________________________________________________________
| o o o bash |
|==============================================================|
| $ sudo docker version |
| Client: Docker Engine - Community |
| Version: 27.3.1 |
| API version: 1.47 |
| Go version: go1.22.7 |
| Git commit: ce12230 |
| Built: Fri Sep 20 11:41:00 2024 |
| OS/Arch: linux/amd64 |
| Context: default |
| |
| Server: Docker Engine - Community |
| Engine: |
| Version: 27.3.1 |
| API version: 1.47 (minimum version 1.24) |
| Go version: go1.22.7 |
| Git commit: 41ca978 |
| Built: Fri Sep 20 11:41:00 2024 |
| OS/Arch: linux/amd64 |
| Experimental: false |
| containerd: |
| Version: 1.7.23 |
| GitCommit: 57f17b0a6295a39009d861b89e3b3b87b005ca27 |
| runc: |
| Version: 1.1.14 |
| GitCommit: v1.1.14-0-g2c9f560 |
| docker-init: |
| Version: 0.19.0 |
| GitCommit: de40ad0 |
'=============================================================='
准备构建 & Dockerfile 编写
构建过程基于 Canonical 维护的 ubuntu:latest 镜像,可以根据需要在构建指令中加入
--build-arg BASE_VERSION=选定版本 选项来自定义镜像的版本。也可以在 Dockerfile 以硬编码的方式来进行更改。
在构建过程中,将对 apt 仓库软件源进行更换,以加快软件的下载速度。为了替换软件源,需要提前准备好 DEB822 格式的 apt 源配置文件,将其命名为
ubuntu.sources 放置在构建目录下。
同时,采取以下几种方式,将 Hadoop 二进制分发包添加进构建环境中:
• 直接从 Apache 维护的软件软下载二进制分发包,将其放置进 Dockerfile 的同级目录。
• 从国内机构维护的镜像源下载二进制分发包,将其放置进 Dockerfile 的同级目录。比如
apache-hadoop安装包下载_开源镜像站-阿里云。
• 取消注释 ADD 指令,在构建时从以上两种软件源中任选一种进行拉取。
Dockerfile 中已对 Hadoop 版本进行硬编码(3.41 版本 发布时间 2024-10-18),也可以根据需要将其更改为任何以 3.n.n 标注的版本进行构建,应该都可以通过。
以下是可直接运行 Dockerfile Source:
ARG BASE_VERSION=latest
ARG WORKDIR=/opt
FROM ubuntu:${BASE_VERSION}
# See: https://docs.docker.com/reference/dockerfile/#understand-how-arg-and-from-interact
ARG WORKDIR
LABEL Version="0.1"
SHELL ["/bin/bash", "-exc"]
WORKDIR ${WORKDIR}
# Init: Prepare file
# ==============================================================================
# Update Apt
## ca-certificates: Change source need
RUN <<-DOC
apt-get -qq update
apt-get -qq install ca-certificates
DOC
# Change Source from Tuna
# See: https://mirrors.tuna.tsinghua.edu.cn/help/ubuntu/
COPY ubuntu.sources /etc/apt/sources.list.d/ubuntu.sources
## Install Dependents
RUN <<-DOC
apt-get -qq update
apt-get -qq install openjdk-11-jdk ssh pdsh xmlstarlet sudo
apt-get clean
DOC
# Check Sum
ARG HADOOP341_SUM=09cda6943625bc8e4307deca7a4df76d676a51aca1b9a0171938b793521dfe1ab5970fdb9a490bab34c12a2230ffdaed2992bad16458169ac51b281be1ab6741
# Download Relese
## Apache mirrot
#ADD --checksum=sha512:${HADOOP341_SUM} https://dlcdn.apache.org/hadoop/common/hadoop-3.4.1/hadoop-3.4.1.tar.gz .
## Aliyun mirror
#ADD --checksum=sha512:${HADOOP341_SUM} https://mirrors.aliyun.com/apache/hadoop/common/hadoop-3.4.1/hadoop-3.4.1.tar.gz .
## Copy From Downloaded
COPY hadoop-3.4.1.tar.gz .
RUN <<-DOC
echo "${HADOOP341_SUM} ./hadoop-3.4.1.tar.gz" > ./hadoop-3.4.1.tar.gz.sha512
sha512sum -c ./hadoop-3.4.1.tar.gz.sha512
rm ./hadoop-3.4.1.tar.gz.sha512
DOC
# Extract Tar
RUN <<-DOC
tar -xf ./hadoop-3.4.1.tar.gz
rm ./hadoop-3.4.1.tar.gz
DOC
# Install done: then start configure
# ==============================================================================
# Configure Environment
ENV JAVA_HOME=/usr/lib/jvm/java-11-openjdk-amd64
ENV HADOOP_HOME=${WORKDIR}/hadoop-3.4.1
ENV PATH=${PATH}:${HADOOP_HOME}/bin:${HADOOP_HOME}/sbin
RUN <<-DOC
echo ${HADOOP_HOME}/etc/hadoop/core-site.xml
cat ${HADOOP_HOME}/etc/hadoop/core-site.xml |
xmlstarlet ed -s '/configuration' -t elem -n property |
xmlstarlet ed -s '/configuration/property[1]' -t elem -n name -v fs.defaultFS |
xmlstarlet ed -s '/configuration/property[1]' -t elem -n value -v hdfs://localhost:9000 > ${HADOOP_HOME}/etc/hadoop/core-site.xml
cat ${HADOOP_HOME}/etc/hadoop/hdfs-site.xml |
xmlstarlet ed -s '/configuration' -t elem -n property |
xmlstarlet ed -s '/configuration/property[1]' -t elem -n name -v dfs.replicatio |
xmlstarlet ed -s '/configuration/property[1]' -t elem -n value -v 1 |
xmlstarlet ed -s '/configuration' -t elem -n property |
xmlstarlet ed -s '/configuration/property[2]' -t elem -n name -v dfs.permissions.enabled |
xmlstarlet ed -s '/configuration/property[2]' -t elem -n value -v false > ${HADOOP_HOME}/etc/hadoop/hdfs-site.xml
sed -i -e "s|# export JAVA_HOME=|export JAVA_HOME=/usr/lib/jvm/java-11-openjdk-amd64|" ${HADOOP_HOME}/etc/hadoop/hadoop-env.sh
DOC
# Creat Init Script
RUN <<-DOC
cat <<- SCRIPT > /usr/local/bin/start_init.sh
#!/bin/bash
sudo sshd
echo start_init done!
/bin/bash
SCRIPT
chmod 775 /usr/local/bin/start_init.sh
DOC
# Other
RUN <<-DOC
# See: https://askubuntu.com/questions/1379425/system-has-not-been-booted-with-systemd-as-init-system-pid-1-cant-operate
mkdir /run/sshd # Manual Launch
# See: https://askubuntu.com/questions/1379425/system-has-not-been-booted-with-systemd-as-init-system-pid-1-cant-operate
#[ -e /etc/hostname ] && echo singlenode > /etc/hostname
#hostnamectl set-hostname singlenode
apt-get install -qq vim
DOC
# Setup done: then create a hadoop user
# ==============================================================================
RUN <<-DOC
groupadd hadoop
cat <<-CONFIG > /etc/sudoers.d/hadoop
%hadoop ALL=(ALL) NOPASSWD: ALL
CONFIG
# Use -m to insure home directory created
useradd -m singlenode -s /bin/bash -G hadoop
echo "singlenode:password" | chpasswd
DOC
# Change User
USER singlenode
WORKDIR /home/singlenode
# Configure SSH
RUN <<-DOC
ssh-keygen -t rsa -P '' -f ~/.ssh/id_rsa
cat ~/.ssh/id_rsa.pub >> ~/.ssh/authorized_keys
chmod 0600 ~/.ssh/authorized_keys
DOC
# Change hadoop accessable
RUN <<-DOC
sudo chmod 770 ${HADOOP_HOME}
sudo chown :hadoop ${HADOOP_HOME}
DOC
# Test
# ==============================================================================
RUN <<-DOC
ls -l /usr/local/bin/start_init.sh
start_init.sh # run init
echo test_java_home: $JAVA_HOME
echo test_hadoop_path: $HADOOP_HOME
echo test_path: $PATH
hadoop version
hdfs namenode -format
start-dfs.sh
# start-yarn.sh
hdfs dfs -mkdir -p /user/singlenode
echo 'test contents' > test.txt
hdfs dfs -put test.txt .
jps
stop-all.sh
DOC
CMD start_init.sh
#RUN 'DEBUG USE BUILD SUCCESS'
开始构建
构建上下文(Build Contexts)为当前目录,构建脚本如下:
#!/bin/bash
# See: https://docs.docker.com/reference/cli/docker/buildx/build/
# --progress, --tag -t
sudo docker build \
--progress=plain \
--tag impipya/hadoop_single_node \
.
构建脚本中,可以通过通过
--tag 维护者/仓库名 来修改生成镜像的名字。
但是注意要同时在启动脚本(见下文)中保持一致的修改。
运行容器
启动容器脚本如下:
#!/bin/bash
echo 'Note: Add "127.0.0.1 singlenode" to the /etc/hosts make the datanode connectable'.
# See: https://docs.docker.com/reference/cli/docker/container/run/
# --tty -t, --interactive -i, --publish -p, --hostname -h, --name
sudo docker run \
--tty \
--interactive \
--publish 9870:9870 \
--publish 9864:9864 \
--hostname singlenode \
--name hadoop_single_node \
impipya/hadoop_single_node
docker run 最后一个参数即为镜像名,需要与构建出来的镜像保持一致。
在运行容器的过程中,将容器的 9870 和 9864 端口进行公开,以方便进行本地访问:
• 9870:
HDFS Web UI 网页管理端口。
• 9864:
Datenode Web UI/通信端口。
同时,需要注意的是,由于 Hadoop 使用
主机名对
datenode 进行寻址,所以还需要在本机配置主机名到 ip 地址的映射,以使
HDFS Web UI 能正常访问和查看
HDFS datenode 上的文件。可以配置
hosts 文件来完成这一目的。
文件位置:
• Microsoft Windows:
C:\Windows\System32\drivers\etc\hosts
• Unix:
/etc/hosts
在文件的末尾添加
127.0.0.1 singlenode
hosts 配置保存即生效(两种类型的系统都是),不用再做其他操作。
运行启动脚本:
./start.sh
启动容器后,终端界面显示如下:
___________________________________________________________________________________
| o o o bash |
|===================================================================================|
| $ ./start.sh |
| Note: Add "127.0.0.1 singlenode" to the /etc/hosts make the datanode connectable. |
| Note: Modify proxy setting to ignore http://singlenode/*. |
| + start_init.sh |
| start_init done! |
| singlenode@singlenode:~$ |
'==================================================================================='
Hadoop 伪分布式配置已在 Dockerfile 构建时完成(包括对 HDFS 文件系统的格式化),可以直接启动 HDFS 以进行测试:
start-dfs.sh
___________________________________________
| o o o bash |
|===========================================|
| singlenode@singlenode:~$ start-dfs.sh |
| Starting namenodes on [localhost] |
| Starting datanodes |
| Starting secondary namenodes [singlenode] |
'==========================================='
使用
jps 查看 Java 虚拟机的运行状态,以查看节点的启动情况:
______________________________
| o o o bash |
|==============================|
| singlenode@singlenode:~$ jps |
| 615 Jps |
| 249 DataNode |
| 138 NameNode |
| 460 SecondaryNameNode |
'=============================='
此时,打开浏览器,可以访问以下任何一个网址打开 HDFS 的 Web UI 界面:
•
localhost:9870
•
127.0.0.1:9870
•
singlenode:9870
关闭容器
特别需要注意的是,在关闭容器之前,要运行:
stop-all.sh
这条命令使 Hadoop 以可控的,安全的方式关闭所有节点。如果直接关闭容器,可能会有在下次无法启动所有节点的现象发生。
如果出现这样的现象,可以尝试清空
/tmp/hadoop 中 Hadoop 的缓存文件,并重新格式化 HDFS,再重新启动 HDFS,看是否能将所有节点启动。
格式化 HDFS 命令:
hdfs namenode -format
Reference
•
Docker Docs
•
Apache Hadoop 3.4.1 – Hadoop: Setting up a Single Node Cluster.
Create: Thu Dec 12 22:42:08 2024
Last Modified: Thu Dec 12 22:42:08 2024
_____ _______ _____ _______
/\ \ /::\ \ /\ \ /::\ \
/::\____\ /::::\ \ /::\____\ /::::\ \
/::::| | /::::::\ \ /::::| | /::::::\ \
/:::::| | /::::::::\ \ /:::::| | /::::::::\ \
/::::::| | /:::/~~\:::\ \ /::::::| | /:::/~~\:::\ \
/:::/|::| | /:::/ \:::\ \ /:::/|::| | /:::/ \:::\ \
/:::/ |::| | /:::/ / \:::\ \ /:::/ |::| | /:::/ / \:::\ \
/:::/ |::|___|______ /:::/____/ \:::\____\ /:::/ |::| | _____ /:::/____/ \:::\____\
/:::/ |::::::::\ \ |:::| | |:::| | /:::/ |::| |/\ \ |:::| | |:::| |
/:::/ |:::::::::\____\|:::|____| |:::|____|/:: / |::| /::\____\|:::|____| |:::|____|
\::/ / ~~~~~/:::/ / \:::\ \ /:::/ / \::/ /|::| /:::/ / \:::\ \ /:::/ /
\/____/ /:::/ / \:::\ \ /:::/ / \/____/ |::| /:::/ / \:::\ \ /:::/ /
/:::/ / \:::\ /:::/ / |::|/:::/ / \:::\ /:::/ /
/:::/ / \:::\__/:::/ / |::::::/ / \:::\__/:::/ /
/:::/ / \::::::::/ / |:::::/ / \::::::::/ /
/:::/ / \::::::/ / |::::/ / \::::::/ /
/:::/ / \::::/ / /:::/ / \::::/ /
/:::/ / \::/____/ /:::/ / \::/____/
\::/ / \::/ /
\/____/ \/____/
_____ _____ _____ _____ _____
/\ \ /\ \ /\ \ /\ \ /\ \
/::\ \ /::\ \ /::\ \ /::\ \ /::\ \
/::::\ \ /::::\ \ /::::\ \ /::::\ \ /::::\ \
/::::::\ \ /::::::\ \ /::::::\ \ /::::::\ \ /::::::\ \
/:::/\:::\ \ /:::/\:::\ \ /:::/\:::\ \ /:::/\:::\ \ /:::/\:::\ \
/:::/__\:::\ \ /:::/__\:::\ \ /:::/__\:::\ \ /:::/ \:::\ \ /:::/__\:::\ \
\:::\ \:::\ \ /::::\ \:::\ \ /::::\ \:::\ \ /:::/ \:::\ \ /::::\ \:::\ \
___\:::\ \:::\ \ /::::::\ \:::\ \ /::::::\ \:::\ \ /:::/ / \:::\ \ /::::::\ \:::\ \
/\ \:::\ \:::\ \ /:::/\:::\ \:::\____\ /:::/\:::\ \:::\ \ /:::/ / \:::\ \ /:::/\:::\ \:::\ \
/::\ \:::\ \:::\____\/:::/ \:::\ \:::| |/:::/ \:::\ \:::\____\/:::/____/ \:::\____\/:::/__\:::\ \:::\____\
\:::\ \:::\ \::/ /\::/ \:::\ /:::|____|\::/ \:::\ /:::/ /\:::\ \ \::/ /\:::\ \:::\ \::/ /
\:::\ \:::\ \/____/ \/_____/\:::\/:::/ / \/____/ \:::\/:::/ / \:::\ \ \/____/ \:::\ \:::\ \/____/
\:::\ \:::\ \ \::::::/ / \::::::/ / \:::\ \ \:::\ \:::\ \
\:::\ \:::\____\ \::::/ / \::::/ / \:::\ \ \:::\ \:::\____\
\:::\ /:::/ / \::/____/ /:::/ / \:::\ \ \:::\ \::/ /
\:::\/:::/ / /:::/ / \:::\ \ \:::\ \/____/
\::::::/ / /:::/ / \:::\ \ \:::\ \
\::::/ / /:::/ / \:::\____\ \:::\____\
\::/ / \::/ / \::/ / \::/ /
\/____/ \/____/ \/____/ \/____/